98 research outputs found

    A Note on (Parallel) Depth- and Breadth-First Search by Arc Elimination

    Full text link
    This note recapitulates an algorithmic observation for ordered Depth-First Search (DFS) in directed graphs that immediately leads to a parallel algorithm with linear speed-up for a range of processors for non-sparse graphs. The note extends the approach to ordered Breadth-First Search (BFS). With pp processors, both DFS and BFS algorithms run in O(m/p+n)O(m/p+n) time steps on a shared-memory parallel machine allowing concurrent reading of locations, e.g., a CREW PRAM, and have linear speed-up for pm/np\leq m/n. Both algorithms need nn synchronization steps

    The Shortest Path Problem with Edge Information Reuse is NP-Complete

    Full text link
    We show that the following variation of the single-source shortest path problem is NP-complete. Let a weighted, directed, acyclic graph G=(V,E,w)G=(V,E,w) with source and sink vertices ss and tt be given. Let in addition a mapping ff on EE be given that associates information with the edges (e.g., a pointer), such that f(e)=f(e)f(e)=f(e') means that edges ee and ee' carry the same information; for such edges it is required that w(e)=w(e)w(e)=w(e'). The length of a simple stst path UU is the sum of the weights of the edges on UU but edges with f(e)=f(e)f(e)=f(e') are counted only once. The problem is to determine a shortest such stst path. We call this problem the \emph{edge information reuse shortest path problem}. It is NP-complete by reduction from 3SAT

    On Optimal Trees for Irregular Gather and Scatter Collectives

    Full text link
    We study the complexity of finding communication trees with the lowest possible completion time for rooted, irregular gather and scatter collective communication operations in fully connected, kk-ported communication networks under a linear-time transmission cost model. Consecutively numbered processors specify data blocks of possibly different sizes to be collected at or distributed from some (given) root processor where they are stored in processor order. Data blocks can be combined into larger segments consisting of blocks from or to different processors, but individual blocks cannot be split. We distinguish between ordered and non-ordered communication trees depending on whether segments of blocks are maintained in processor order. We show that lowest completion time, ordered communication trees under one-ported communication can be found in polynomial time by giving simple, but costly dynamic programming algorithms. In contrast, we show that it is an NP-complete problem to construct cost-optimal, non-ordered communication trees. We have implemented the dynamic programming algorithms for homogeneous networks to evaluate the quality of different types of communication trees, in particular to analyze a recent, distributed, problem-adaptive tree construction algorithm. Model experiments show that this algorithm is close to the optimum for a selection of block size distributions. A concrete implementation for specially structured problems shows that optimal, non-binomial trees can possibly have even further practical advantage

    Simplified, stable parallel merging

    Full text link
    This note makes an observation that significantly simplifies a number of previous parallel, two-way merge algorithms based on binary search and sequential merge in parallel. First, it is shown that the additional merge step of distinguished elements as found in previous algorithms is not necessary, thus simplifying the implementation and reducing constant factors. Second, by fixating the requirements to the binary search, the merge algorithm becomes stable, provided that the sequential merge subroutine is stable. The stable, parallel merge algorithm can easily be used to implement a stable, parallel merge sort. For ordered sequences with nn and mm elements, mnm\leq n, the simplified merge algorithm runs in O(n/p+logn)O(n/p+\log n) operations using pp processing elements. It can be implemented on an EREW PRAM, but since it requires only a single synchronization step, it is also a candidate for implementation on other parallel, shared-memory computers

    VieM v1.00 -- Vienna Mapping and Sparse Quadratic Assignment User Guide

    Full text link
    This paper severs as a user guide to the mapping framework VieM (Vienna Mapping and Sparse Quadratic Assignment). We give a rough overview of the techniques used within the framework and describe the user interface as well as the file formats used.Comment: arXiv admin note: text overlap with arXiv:1311.171

    Stamp-it: A more Thread-efficient, Concurrent Memory Reclamation Scheme in the C++ Memory Model

    Full text link
    We present Stamp-it, a new, concurrent, lock-less memory reclamation scheme with amortized, constant-time (thread-count independent) reclamation overhead. Stamp-it has been implemented and proved correct in the C++ memory model using as weak memory-consistency assumptions as possible. We have likewise (re)implemented six other comparable reclamation schemes. We give a detailed performance comparison, showing that Stamp-it performs favorably (sometimes better, at least as good as) than most of these other schemes while being able to reclaim free memory nodes earlier.Comment: arXiv admin note: substantial text overlap with arXiv:1712.0613

    On the State and Importance of Reproducible Experimental Research in Parallel Computing

    Full text link
    Computer science is also an experimental science. This is particularly the case for parallel computing, which is in a total state of flux, and where experiments are necessary to substantiate, complement, and challenge theoretical modeling and analysis. Here, experimental work is as important as are advances in theory, that are indeed often driven by the experimental findings. In parallel computing, scientific contributions presented in research articles are therefore often based on experimental data, with a substantial part devoted to presenting and discussing the experimental findings. As in all of experimental science, experiments must be presented in a way that makes reproduction by other researchers possible, in principle. Despite appearance to the contrary, we contend that reproducibility plays a small role, and is typically not achieved. As can be found, articles often do not have a sufficiently detailed description of their experiments, and do not make available the software used to obtain the claimed results. As a consequence, parallel computational results are most often impossible to reproduce, often questionable, and therefore of little or no scientific value. We believe that the description of how to reproduce findings should play an important part in every serious, experiment-based parallel computing research article. We aim to initiate a discussion of the reproducibility issue in parallel computing, and elaborate on the importance of reproducible research for (1) better and sounder technical/scientific papers, (2) a sounder and more efficient review process and (3) more effective collective work. This paper expresses our current view on the subject and should be read as a position statement for discussion and future work. We do not consider the related (but no less important) issue of the quality of the experimental design

    A new and five older Concurrent Memory Reclamation Schemes in Comparison (Stamp-it)

    Full text link
    Memory management is a critical component in almost all shared-memory, concurrent data structures and algorithms, consisting in the efficient allocation and the subsequent reclamation of shared memory resources. This paper contributes a new, lock-free, amortized constant-time memory reclamation scheme called \emph{Stamp-it}, and compares it to five well-known, selectively efficient schemes from the literature, namely Lock-free Reference Counting, Hazard Pointers, Quiescent State-based Reclamation, Epoch-based Reclamation, and New Epoch-based Reclamation. An extensive, experimental evaluation with both new and commonly used benchmarks is provided, on four different shared-memory systems with hardware supported thread counts ranging from 48 to 512, showing Stamp-it to be competitive with and in many cases and aspects outperforming other schemes

    More Parallelism in Dijkstra's Single-Source Shortest Path Algorithm

    Full text link
    Dijkstra's algorithm for the Single-Source Shortest Path (SSSP) problem is notoriously hard to parallelize in o(n)o(n) depth, nn being the number of vertices in the input graph, without increasing the required parallel work unreasonably. Crauser et al.\ (1998) presented observations that allow to identify more than a single vertex at a time as correct and correspondingly more edges to be relaxed simultaneously. Their algorithm runs in parallel phases, and for certain random graphs they showed that the number of phases is O(n1/3)O(n^{1/3}) with high probability. A work-efficient CRCW PRAM with this depth was given, but no implementation on a real, parallel system. In this paper we strengthen the criteria of Crauser et al., and discuss tradeoffs between work and number of phases in their implementation. We present simulation results with a range of common input graphs for the depth that an ideal parallel algorithm that can apply the criteria at no cost and parallelize relaxations without conflicts can achieve. These results show that the number of phases is indeed a small root of nn, but still off from the shortest path length lower bound that can also be computed. We give a shared-memory parallel implementation of the most work-efficient version of a Dijkstra's algorithm running in parallel phases, which we compare to an own implementation of the well-known Δ\Delta-stepping algorithm. We can show that the work-efficient SSSP algorithm applying the criteria of Crauser et al. is competitive to and often better than Δ\Delta-stepping on our chosen input graphs. Despite not providing an o(n)o(n) guarantee on the number of required phases, criteria allowing concurrent relaxation of many correct vertices may be a viable approach to practically fast, parallel SSSP implementations

    kk-ported vs. kk-lane Broadcast, Scatter, and Alltoall Algorithms

    Full text link
    In kk-ported message-passing systems, a processor can simultaneously receive kk different messages from kk other processors, and send kk different messages to kk other processors that may or may not be different from the processors from which messages are received. Modern clustered systems may not have such capabilities. Instead, compute nodes consisting of nn processors can simultaneously send and receive kk messages from other nodes, by letting kk processors on the nodes concurrently send and receive at most one message. We pose the question of how to design good algorithms for this kk-lane model, possibly by adapting algorithms devised for the traditional kk-ported model. We discuss and compare a number of (non-optimal) kk-lane algorithms for the broadcast, scatter and alltoall collective operations (as found in, e.g., MPI), and experimentally evaluate these on a small 36×3236\times 32-node cluster with a dual OmniPath network (corresponding to k=2k=2). Results are preliminary
    corecore